[CI/Build] Bump flashinfer to v0.6.10#41711
Conversation
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Hi @arpera, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request updates the FlashInfer version to 0.6.10 across the project's Docker configurations and dependency files. It also introduces conditional logic in the Dockerfile and setup.py to include the [cu13] extra for flashinfer-python when CUDA 13 is detected, facilitating support for SM100 GDN kernels. I have no feedback to provide.
|
FYI: 0.6.9 update - #40998 |
|
Yes, I have seen this PR #40998, thanks. It wasn't finished, so I think now v0.6.10 makes more sense. |
|
I would also like to point out that in this PR, in addition to directly integrating the new FI version v0.6.8, I made a small fix that wasn't accounted for in vLLM when integrating previous FI versions. There is also a small discussion about this issue in comments: 1, 2. Since I don't have much experience managing build dependencies in vLLM, I'd be happy to get suggestions for a more correct way to handle this in vLLM. |
|
I am noticing some potential numeric issues with the newer flashinfer versions. Specifically, the generation length for GPQA with DSv4 with the new versions are significantly longer than before (claude suggests the model is stuck in self-doubt loop). I am still investigating the issue. But just wanted to flag this out. It may be worth doing some more eval studies before merging this. |
|
Do I understand correctly that if have environment with cu13 and do |
|
Yes, you understand right |
|
@arpera With more investigation, I think the issue that I was hitting was not related to newer flashinfer versions (but with something else). I tested v0.6.10 GPQA eval with deepseek v4, it looks good. I have no more concern for upgrading. |
|
side note: we released 0.6.10.post1 not long ago for fixing an allreduce hang about missing rendezvous sync group |
|
@aleozlx, that is good to know, thank you! |
|
0.6.11 is out cc @mgoin |
## 📌 Description
`gen_jit_spec` adds `-DNDEBUG` only to `extra_cuda_cflags` (consumed by
`nvcc` for `.cu` files), not to `extra_cflags` (consumed by `g++` for
host-side `.cpp`). Several host-only translation units are part of
MoE/GEMM JIT specs — most notably
`csrc/nv_internal/cpp/common/logger.cpp` — and they end up compiled
without `NDEBUG` while the rest of the module is a release build.
For the TensorRT-LLM logger this matters because of:
```cpp
// csrc/nv_internal/include/tensorrt_llm/common/logger.h
#ifndef NDEBUG
Level const DEFAULT_LOG_LEVEL = DEBUG;
#else
Level const DEFAULT_LOG_LEVEL = INFO;
#endif
```
With `NDEBUG` missing on the host side, every prebuilt
`flashinfer-jit-cache` wheel ships with `Logger::level_ = DEBUG (10)`.
On Hopper this turns each MoE forward pass into a stream of
`[TensorRT-LLM][DEBUG] ... sm90_generic_mixed_moe_gemm_kernelLauncher
...` lines from the OSS CUTLASS kernel dispatcher. Verified by reading
the data-section initializer of `Logger::Logger()` in the released
`flashinfer-jit-cache==0.6.10+cu130`
`fused_moe_{90,100,103,120,trtllm_sm100}.so` — all five start `Logger`
with `DEFAULT_LOG_LEVEL=10` and `level_=10`, even though the same wheels
carry no `.debug_*` sections (i.e. they are otherwise release-built).
The fix is one line: also append `-DNDEBUG` to the host `cflags` when
not in debug mode. The `flashinfer-jit-cache` wheel build picks this up
automatically and the prebuilt logger flips back to `INFO`.
## 🔍 Related Issues
Initially this bug was observed during integration of FI v0.6.10 into
vLLM: [[CI/Build] Bump flashinfer to v0.6.10
#41711](vllm-project/vllm#41711).
There is a CI job log failure due to this issue:
[buildkite/ci/pr/distributed-tests-2-gpus-h100](https://buildkite.com/vllm/ci/builds/64532#019df966-e67d-4c27-af0e-76b00bc496e5).
Surfaced while debugging a downstream CI step that produced a 2.9 GB log
dominated by TRT-LLM debug prints from `fused_moe_90.so`. No FlashInfer
issue tracking this yet — happy to file one alongside this PR if useful.
## 🚀 Pull Request Checklist
### ✅ Pre-commit Checks
- [x] I have installed `pre-commit` by running `pip install pre-commit`.
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.
## 🧪 Tests
- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`pytest tests/test_jit_cpp_ext.py`).
Two regression tests added in `tests/test_jit_cpp_ext.py`, mirroring the
existing `test_debug_jit_uses_sccache_compatible_nvcc_device_debug_flag`
style:
```
pytest tests/test_jit_cpp_ext.py -v
```
```
test_release_jit_propagates_ndebug_to_host_cflags PASSED
test_debug_jit_does_not_propagate_ndebug PASSED
```
The first asserts that a release build
(`FLASHINFER_JIT_DEBUG`/`FLASHINFER_JIT_VERBOSE` unset) puts `-DNDEBUG`
in **both** `spec.extra_cflags` and `spec.extra_cuda_cflags`. The second
locks in symmetry: with `FLASHINFER_JIT_DEBUG=1` neither list contains
`-DNDEBUG`. Without the fix, the first test fails on `assert "-DNDEBUG"
in spec.extra_cflags`.
## Reviewer Notes
Single-line behavior change in `flashinfer/jit/core.py`. No effect on
debug builds. Prebuilt wheels rebuilt from this commit will pick up the
change automatically — no schema/version bump needed.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* JIT-compiled code now includes optimized compilation flags in release
mode for improved performance.
* **Tests**
* Added test coverage for proper compilation flag handling between debug
and release build modes.
[](https://app.coderabbit.ai/change-stack/flashinfer-ai/flashinfer/pull/3278)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
|
Current up-to-date FI's version (v0.6.11) has a problem that some of the kernels were compiled with DEFAULT_LOG_LEVEL=DEBUG. Due to this problem some of the tests in our CI run failed, for example, buildkite/ci/pr/gpqa-eval-gpt-oss-h100. I managed to fix that issue of FI's side and merged this fix into upstream. fix(jit): propagate -DNDEBUG to host-side cflags#3278. Then I have a question that I would like to ask you. As I understand we have now two options:
What do you think about these options? Should we wait until next release or we should do a temporary solution in vllm and remove it later? |
|
@arpera Can you request for a patched release 0.6.11.post1 with the fix you just merged? Should be a reasonable thing to ask. also, is the failure caused by the log level or some other issues that were hard to check due to the verbose log level? |
Thanks for suggestion! I will ask FI's team to do this.
It was hard to check if there is other issues besides that one because logs were about several GBs each. I see that some of failed CI jobs are flaky and failed not due to update. Nevertheless there is still a possibility that FI's version update caused something else in our CI. |
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Purpose
v0.6.8.post1tov0.6.10.flashinfer-python[cu13]extra for cu13 users.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.